Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals

نویسندگان

Mukkai S. Krishnamoorthy

George Nagy

Sharad C. Seth

Mahesh Viswanathan

چکیده

Alternating horizontal and vertical projection profiles are extracted from nested sub-blocks of scanned page images of technical documents. The thresholded profile strings are parsed using the compiler utilities Lex and Yacc. The significant document components are demarcated and identified by the recursive application of block grammars. Backtracking for error recovery and branch and bound for maximum-area labeling are implemented with Unix Shell programs. Results of the segmentation and labeling process are stored in a labeled X-Y tree. It is shown that families of technical documents that share the same layout conventions can be readily analyzed. More than 20 types of document entities can be identified in sample pages from the IBM Journal of Research and Development and IEEE TRANSACTIONS ON PAITERN ANALYSIS AND MACHINE INTELLIGENCE. Potential applications include preprocessors for optical character recognition, document archival, and digital reprographics.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

برچسب‌زنی نقش معنایی جملات فارسی با رویکرد یادگیری مبتنی بر حافظه

Abstract Extracting semantic roles is one of the major steps in representing text meaning. It refers to finding the semantic relations between a predicate and syntactic constituents in a sentence. In this paper we present a semantic role labeling system for Persian, using memory-based learning model and standard features. Our proposed system implements a two-phase architecture to first identify...

متن کامل

برچسب‌زنی خودکار نقش‌های معنایی در جملات فارسی به کمک درخت‌های وابستگی

Automatic identification of words with semantic roles (such as Agent, Patient, Source, etc.) in sentences and attaching correct semantic roles to them, may lead to improvement in many natural language processing tasks including information extraction, question answering, text summarization and machine translation. Semantic role labeling systems usually take advantage of syntactic parsing and th...

متن کامل

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...

متن کامل

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

متن کامل

CzEng 0.9: Large Parallel Treebank with Rich Annotation

We describe our ongoing efforts in collecting a Czech-English parallel corpus CzEng. The paper provides full details on the current version 0.9 and focuses on its new features: (1) data from new sources were added, most importantly a few hundred electronically available books, technical documentation and also some parallel web pages, (2) the full corpus has been automatically annotated up to th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

IEEE Trans. Pattern Anal. Mach. Intell.

دوره 15 شماره

صفحات -

تاریخ انتشار 1993

Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals

نویسندگان

چکیده

منابع مشابه

برچسب‌زنی نقش معنایی جملات فارسی با رویکرد یادگیری مبتنی بر حافظه

برچسب‌زنی خودکار نقش‌های معنایی در جملات فارسی به کمک درخت‌های وابستگی

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

CzEng 0.9: Large Parallel Treebank with Rich Annotation

عنوان ژورنال:

اشتراک گذاری